In this blog, I want to discuss Training and Debugging of a NN in a practical manner.

I am taking a cyber troll dataset. It is a classification data with labels aggressive or not. This is mostly inspired from this blog.

Whenever I train any neural network, I will divide that into subtasks as below. I am assuming, you already set your project goals and evaluation metrics.

If you don't have proper data, create/collect proper data.
Data preprocessing, EDA and creating structured data.
Writing a basic Model.
Training and debugging with very small data. Try to overfit to the very small training data(we will get to know basic errors and rectify it).
Writing a better data pipeline to train the NN with total data.
Train the model with Total data and Tune the model parameters. (While training, we may face some issues and you have to rectify those)
Compare the model with the SOTA/any other real-time systems and try to improve the model by changing the basic model or training a new model. so you have to go to step-3 again with a new model.
Compare the results and do the error analysis. I personally feel, based on error analysis, we can improve the model so much. It is similar to the active learning concept.
If you feel data is the issue, go to step-1 and check again.

##basic imports
import numpy as np
import pandas as pd
import random as rn
import tensorflow as tf
from tensorflow.keras.layers import LSTM, GRU, Dense, Input, Embedding
from tensorflow.keras.models import Model

Data Processing

##reading the data
cyber_troll_data = pd.read_json('Dataset for Detection of Cyber-Trolls.json', lines=True)
cyber_troll_data.head(2)

#basic preprocessing
cyber_troll_data['label']=cyber_troll_data.annotation.apply(lambda x: int(x['label'][0]))
cyber_troll_data = cyber_troll_data[['content', 'label']]
cyber_troll_data.head()

#its a imbalance one
cyber_troll_data.label.value_counts()

0    12179
1     7822
Name: label, dtype: int64

##splitting data into train, validation and Test data. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cyber_troll_data.content, cyber_troll_data.label, 
                                                    test_size=0.40, stratify=cyber_troll_data.label, random_state=54)

X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, 
                                                    test_size=0.50, stratify=y_test, random_state=32)


tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X_train)


X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)
X_val_tokens = tokenizer.texts_to_sequences(X_val)


number_vocab = len(tokenizer.word_index)+1

X_train_pad_tokens = tf.keras.preprocessing.sequence.pad_sequences(X_train_tokens, maxlen=24, padding='post', truncating='post')
X_test_pad_tokens = tf.keras.preprocessing.sequence.pad_sequences(X_test_tokens, maxlen=24, padding='post', truncating='post')
X_val_pad_tokens = tf.keras.preprocessing.sequence.pad_sequences(X_val_tokens, maxlen=24, padding='post', truncating='post')

We prepared the data. I am not doing perfect preprocessing and tokenization. You can do preprocessing in a better way.

We have,
X_train_pad_tokens, y_train --> To train X_val_pad_tokens, y_val --> To validate and Tune X_test_pad_tokens, y_test --> Don't use this data while trainig. Only use this after you are done with all the modelling.

I am creating Training and Validation datasets to iterate over those using the tf.data pipeline. Please use less data as of now because, it will be easier to debug and easier to know about the error if we have any in our network, I will discuss this below. I am only using the first 100 data points with a batch size of 32.

##Creating the dataset( only 100 data points and will explain why after trainig process.) 
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_pad_tokens[0:100], y_train[0:100]))
train_dataset = train_dataset.shuffle(1000).batch(32, drop_remainder=True)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

##creating test dataset using tf.data
val_dataset = tf.data.Dataset.from_tensor_slices((X_val_pad_tokens[0:100], y_val[0:100]))
val_dataset = val_dataset.batch(32, drop_remainder=True)
val_dataset = val_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

Checking the data pairing issue and check the data given to neural network is correct or not. If it got corrupted, check/debug the data pipleline and rectify it. If you have images, try to plot the images and check.

below, i have written a basic for loop to print. You can also print the words corresponding to the numbers and check.

for input_text, output_label in train_dataset:
    print(input_text[0:3], output_label[0:3])
    break

tf.Tensor(
[[ 186   89  741    5  385   43   11  127  919 1082  157    1    9  251
     5  628    3 6970    5   11 4641   30    6   40]
 [  27    3   26   28 1021   29    6    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0]
 [4647   72  606   43   16  684  223    1    9    3 4648  923    0    0
     0    0    0    0    0    0    0    0    0    0]], shape=(3, 24), dtype=int32) tf.Tensor([0 1 1], shape=(3,), dtype=int32)

Creating a Neural Network

Some of the rules to follow while writing/training your Neural Network.

Start with a simple architecture - We are doing a text classification so, we can try a single layer LSTM.
Use well studied default parameters like activation = relu, optimizer = adam, initialization = he for relu and Glorot for sigmoid/tanh. To know more about this, please read this blog.
Fix the random seeds so that we can reproduce the initializations/results to tune our models. - You have to fix all the random seeds in your model.
Normalize the input data.

I am writing a simple LSTM model by following all the above rules.

##LSTM

##fixing numpy RS
np.random.seed(42)

##fixing tensorflow RS
tf.random.set_seed(32)

##python RS
rn.seed(12)


##model
def get_model():
    input_layer = Input(shape=(24,), name="input_layer")
    ##i am initilizing randomly. But you can use predefined embeddings. 
    x_embedd = Embedding(input_dim=number_vocab, output_dim=100, input_length=24, mask_zero=True, 
                        embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23),
                         name="Embedding_layer")(input_layer)
    
    x_lstm = LSTM(units=20, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
                 kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
                 recurrent_initializer=tf.keras.initializers.orthogonal(seed=54),
                 bias_initializer=tf.keras.initializers.zeros(), name="LSTM_layer")(x_embedd)
    
    x_out = Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45),
                  name="output_layer")(x_lstm)
    
    basic_lstm_model = Model(inputs=input_layer, outputs=x_out, name="basic_lstm_model")
    
    return basic_lstm_model


basic_lstm_model = get_model()
basic_lstm_model_anothertest = get_model()

Now i created two models named basic_lstm_model, basic_lstm_model_anothertest. Those two model initial weights will be the same because of the fixed random seed. This removes a factor of a variation and very useful to tune parameters by doing some experimentation on the same weight initialization.

we can check this as below.

[np.all(basic_lstm_model.get_weights()[i]==basic_lstm_model_anothertest.get_weights()[i]) \
 for i in range(len(basic_lstm_model.get_weights()))]

[True, True, True, True, True, True]

Training a NN

Loss functions - If we calculate the loss in the wrong manner, we will get the wrong gradients and it doesn't learn perfectly.

Some of the mistakes in Loss functions:

one of the main mistakes in the loss creation is giving wrong inputs to the loss function. If we are using the cross-entropy, you have to give one-hot vector as input otherwise, use sparse_categorical_crossentropy(no need to give the one-hot vectors).
If you are using a function that calculates the loss using unnormalized logits, don't give the probability output as input to the loss function. ( check logits parameter in the tensorflow loss functions)
It is useful to mask unnecessary output while calculating loss. Eg: don't include output at the padded word position while calculation loss.
Selecting a loss function that allowing the calculation of large error values. Because of this, your loss may explode, you may get NaN and it affects the gradients too.

##masked loss Eg for sequence output. 
def maskedLoss(y_true, y_pred):
    #getting mask value
    mask = tf.math.logical_not(tf.math.equal(y_true, 0))
    
    #calculating the loss
    loss_ = loss_function(y_true, y_pred)
    
    #converting mask dtype to loss_ dtype
    mask = tf.cast(mask, dtype=loss_.dtype)
    
    #applying the mask to loss
    loss_ = loss_*mask
    
    #getting mean over all the values
    loss_ = tf.reduce_sum(loss_)/tf.reduce_sum(mask)
    return loss_

##creating a loss object for this classification problem
loss_function = tf.keras.losses.BinaryCrossentropy(from_logits=False, reduction='auto')

Training and validation functions

We have to take care of the toggling training flag because some of the layers behaves differently in training and testing.

#optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

#trainign function
@tf.function
def train_step(input_vector, output_vector,loss_fn):
    #taping the gradients
    with tf.GradientTape() as tape:
        #for ward prop
        output_predicted = basic_lstm_model(inputs=input_vector, training=True)
        #loss calculation
        loss = loss_fn(output_vector, output_predicted)
    #getting gradients
    gradients = tape.gradient(loss, basic_lstm_model.trainable_variables)
    #applying gradients
    optimizer.apply_gradients(zip(gradients, basic_lstm_model.trainable_variables))
    return loss, output_predicted

#validation function
@tf.function
def val_step(input_vector, output_vector, loss_fn):
    #forward prop
    output_predicted = basic_lstm_model(inputs=input_vector, training=False)
    #loss calculation
    loss = loss_fn(output_vector, output_predicted)
    return loss, output_predicted

Training the NN with proper data.

While Training the model, I suggest you don't write the complex pipelining of the data and train your network at the start. If you do this, finding the bugs in your network is very difficult. Just get a few instances of data( maybe 10% of your total train data if you have 10K records) into your RAM and try to train your network. In this case, I have total data in my RAM so, I will slice a few batches and try to train the network.

I will suggest you don't include the data augmentation as of now. It is useful for regularizing the model but try to avoid it at the start. Even if you do data augmentation, be careful about the labels. Eg: In the segmentation task, if you flip the image, you have to flip the label image as well.

Check for casting issues. Eg. If layer needs int8, give the int8 value only as input. If you have float values, just cast the dtype. If data stored in the disk is float32, load the data into RAM with the same dtype.

Check the data pairing issue i.e. while giving the train data, you have to give the correct pairs of x and y. Training the NN with proper data.

Training the NN with data for 2 epochs and printing batchwise loss and finally getting mean of all those. Even if you use the .fit method of Keras API, it prints the aggregated value of loss/metric as part of verbose. You can check that aggregate class here

##training
EPOCHS=2

##metrics # Even if you use .fit method, it alsocalculates batchwise loss/metric and aggregates those.  
train_loss = tf.keras.metrics.Mean(name='train_loss')
val_loss = tf.keras.metrics.Mean(name='test_loss')

for epoch in range(EPOCHS):
    #losses
    train_loss.reset_states()
    val_loss.reset_states()
    
    #training
    print('Batchwise Train loss')
    for text_seq, label_seq in train_dataset:
        loss_, pred_out = train_step(text_seq, label_seq, loss_function)
        print(loss_)
        train_loss(loss_)
    
    #validation
    print('Batchwise Val loss')
    for text_seq_val, label_seq_val in val_dataset:
        loss_test, pred_out_test = val_step(text_seq_val, label_seq_val, loss_function)
        print(loss_test)
        val_loss(loss_test)
    
    template = 'Epoch {}, Mean Loss: {}, Mean Val Loss: {}'
    
    print(template.format(epoch+1, train_loss.result(), val_loss.result()))
    print('-'*50)

Batchwise Train loss
tf.Tensor(0.69066906, shape=(), dtype=float32)
tf.Tensor(0.6978342, shape=(), dtype=float32)
tf.Tensor(0.7214557, shape=(), dtype=float32)
Batchwise Val loss
tf.Tensor(0.7479876, shape=(), dtype=float32)
tf.Tensor(0.6868224, shape=(), dtype=float32)
tf.Tensor(0.71952724, shape=(), dtype=float32)
Epoch 1, Mean Loss: 0.7033197283744812, Mean Val Loss: 0.7181124687194824
--------------------------------------------------
Batchwise Train loss
tf.Tensor(0.6816538, shape=(), dtype=float32)
tf.Tensor(0.69258916, shape=(), dtype=float32)
tf.Tensor(0.6689039, shape=(), dtype=float32)
Batchwise Val loss
tf.Tensor(0.744266, shape=(), dtype=float32)
tf.Tensor(0.681653, shape=(), dtype=float32)
tf.Tensor(0.71762204, shape=(), dtype=float32)
Epoch 2, Mean Loss: 0.6810489296913147, Mean Val Loss: 0.7145137190818787
--------------------------------------------------

Debugging and Enhancing NN

Till now, we have created a basic NN for our problem and trained the NN. Now I will discuss some hacks to debug and enhance your training process to get better results.

Using Basic print statements and checking the shapes of input and output of every layer. Using this, we can remove the shape related error or basic errors related to output while creating a model. If you want to print in tensorflow code, please use tf.print

With Eager execution, we can debug our code very easily. it can be done using pdb or using any ide. You have to set tf.config.experimental_run_functions_eagerly(True) to debug your tf2.0 functions.

##LSTM

tf.config.experimental_run_functions_eagerly(True)

##fixing numpy RS
np.random.seed(42)

##fixing tensorflow RS
tf.random.set_seed(32)

##python RS
rn.seed(12)

import pdb

##model
def get_model_debug():
    input_layer_d = Input(shape=(24,), name="input_layer")
    ##i am initilizing randomly. But you can use predefined embeddings. 
    x_embedd_d= Embedding(input_dim=number_vocab, output_dim=100, input_length=24, mask_zero=True, 
                        embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23),
                         name="Embedding_layer")(input_layer_d)
    
    #LSTM
    x_lstm_d = LSTM(units=20, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
                 kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
                 recurrent_initializer=tf.keras.initializers.orthogonal(seed=54),
                 bias_initializer=tf.keras.initializers.zeros(), name="LSTM_layer")(x_embedd_d)
    
    #trace
    pdb.set_trace()
    
    x_out_d = Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45),
                  name="output_layer")(x_lstm_d)
    
    basic_lstm_model_d = Model(inputs=input_layer_d, outputs=x_out_d, name="basic_lstm_model_d")
    
    return basic_lstm_model_d


basic_model_debug = get_model_debug()

tf.config.experimental_run_functions_eagerly(False)

> <ipython-input-14-476c66b41633>(31)get_model_debug()
-> x_out_d = Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45),

{'input_layer_d': <tf.Tensor 'input_layer_2:0' shape=(None, 24) dtype=float32>, 'x_embedd_d': <tf.Tensor 'Embedding_layer_2/Identity:0' shape=(None, 24, 100) dtype=float32>, 'x_lstm_d': <tf.Tensor 'LSTM_layer_2/Identity:0' shape=(None, 20) dtype=float32>}

> <ipython-input-14-476c66b41633>(32)get_model_debug()
-> name="output_layer")(x_lstm_d)

> <ipython-input-14-476c66b41633>(34)get_model_debug()
-> basic_lstm_model_d = Model(inputs=input_layer_d, outputs=x_out_d, name="basic_lstm_model_d")

You can also Debug the Trainig loopas shown below.
For PDB instrctions, please check this PDF.

My preference and suggestion is to use IDE Debugger

##training
EPOCHS=1
##metrics # Even if you use .fit method, it alsocalculates batchwise loss/metric and aggregates those.  
train_loss = tf.keras.metrics.Mean(name='train_loss')

tf.config.experimental_run_functions_eagerly(True)
for epoch in range(EPOCHS):
    train_loss.reset_states()
    
    print('Batchwise Train loss')
    for text_seq, label_seq in train_dataset:
        pdb.set_trace()
        loss_, pred_out = train_step(text_seq, label_seq, loss_function)
        print(loss_)
        train_loss(loss_)
    
    template = 'Epoch {}, Mean Loss: {}'
    
    print(template.format(epoch+1, train_loss.result()))
    print('-'*50)
tf.config.experimental_run_functions_eagerly(False)

Batchwise Train loss
> <ipython-input-15-aa3750dbfb83>(13)<module>()
-> loss_, pred_out = train_step(text_seq, label_seq, loss_function)

--Call--
> d:\softwares\anaconda3\envs\tf2\lib\site-packages\tensorflow_core\python\eager\def_function.py(551)__call__()
-> def __call__(self, *args, **kwds):

tf.Tensor(0.66431165, shape=(), dtype=float32)
> <ipython-input-15-aa3750dbfb83>(12)<module>()
-> pdb.set_trace()

tf.Tensor(0.6668887, shape=(), dtype=float32)
> <ipython-input-15-aa3750dbfb83>(13)<module>()
-> loss_, pred_out = train_step(text_seq, label_seq, loss_function)

tf.Tensor(0.6523603, shape=(), dtype=float32)
Epoch 1, Mean Loss: 0.6611868739128113
--------------------------------------------------

Once you are done with the creation of the model, Try to Train the model with less data( i have taken 100 samples) and try to overfit the model to that data. To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero). If your model is unable to overfit a few data points, then either it's too small (which is unlikely in today's age), or something is wrong in its structure or the learning algorithm. check for bugs and try to remove those. I will discuss some of the bugs below. If this model is working fine without any bugs, you can train with full data.

Tensorboard is another important tool to debug NN while training. You can visualize the Loss, metrics, gradient/output histograms, distributions, graph and many more. I am writing code to plot all these in the tensorboard.
As of now, we are printing/plotting the Mean loss/metric for all the batches in one epoch and, based on this we are analyzing the model performance. This may lead to wrong models for some of the loss functions/metrics. Even if you use the smoothing, it is not an accurate one, it will get an exponentially weighted average over batch-wise loss/metric. so Try to get a loss/metric for entire data of train and Val/test. If you have time/space constraint, at least get for the val/test data. Eg: Mean of Cross entropy over batches is equal to the cross-entropy over total data but not for AUC/F1 score.

Below I have written code that calculates loss and metric(AUC) over batches and gets the mean as well as a total loss at once and a better Training and validation functions with tensorboard. please look into it.

##training

##model creation
basic_lstm_model = get_model()

##optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.005)

##metric
from sklearn.metrics import roc_auc_score

##train step function to train
@tf.function
def train_step(input_vector, output_vector,loss_fn):
    with tf.GradientTape() as tape:
        #forward propagation
        output_predicted = basic_lstm_model(inputs=input_vector, training=True)
        #loss
        loss = loss_fn(output_vector, output_predicted)
    #getting gradients
    gradients = tape.gradient(loss, basic_lstm_model.trainable_variables)
    #applying gradients
    optimizer.apply_gradients(zip(gradients, basic_lstm_model.trainable_variables))
    return loss, output_predicted, gradients

##validation step function
@tf.function
def val_step(input_vector, output_vector, loss_fn):
    #getting output of validation data
    output_predicted = basic_lstm_model(inputs=input_vector, training=False)
    #loss calculation
    loss = loss_fn(output_vector, output_predicted)
    return loss, output_predicted

import math

#batch size
BATCH_SIZE=32
##number of epochs
EPOCHS=10

##metrics # Even if you use .fit method, it alsocalculates batchwise loss/metric and aggregates those.  
train_loss = tf.keras.metrics.Mean(name='train_loss')
val_loss = tf.keras.metrics.Mean(name='val_loss')
train_metric = tf.keras.metrics.Mean(name="train_auc")
val_metric = tf.keras.metrics.Mean(name="val_metric")

#tensorboard file writers
wtrain = tf.summary.create_file_writer(logdir='logs\\train')
wval = tf.summary.create_file_writer(logdir='logs\\val')


#no of data points/batch_size i.e number of iterations in the one epoch
iters = math.ceil(100/BATCH_SIZE) 

#training anf validating
for epoch in range(EPOCHS):
    
    #resetting the states of the loss and metrics
    train_loss.reset_states()
    val_loss.reset_states()
    train_metric.reset_states()
    val_metric.reset_states()
    
    ##counter for train loop iteration
    counter = 0
    
    #lists to save true and validation data. 
    train_true = []
    train_predicted = []
    val_true = []
    val_predicted = []
    
    #ietrating over train data batch by batch
    for text_seq, label_seq in train_dataset:
        #train step
        loss_, pred_out, gradients = train_step(text_seq, label_seq, loss_function)
        #adding loss to train loss
        train_loss(loss_)
        #counting the step number
        temp_step = epoch*iters+counter
        counter = counter + 1
        
        #calculating AUC for batch
        batch_metric = roc_auc_score(label_seq, pred_out)
        train_metric(batch_metric)
        
        #appending it to list
        train_predicted.append(pred_out)
        train_true.append(label_seq)
        
        ##tensorboard 
        with tf.name_scope('per_step_training'):
            with wtrain.as_default():
                tf.summary.scalar("batch_loss", loss_, step=temp_step)
                tf.summary.scalar('batch_metric', batch_metric, step=temp_step)
        with tf.name_scope("per_batch_gradients"):
            with wtrain.as_default():
                for i in range(len(basic_lstm_model.trainable_variables)):
                    name_temp = basic_lstm_model.trainable_variables[i].name
                    tf.summary.histogram(name_temp, gradients[i], step=temp_step)
    
    #calculating the final loss and metric
    train_true = tf.concat(train_true, axis=0)
    train_predicted = tf.concat(train_predicted, axis=0)
    train_loss_final = loss_function(train_true, train_predicted)
    train_metric_auc = roc_auc_score(train_true, train_predicted)
    
    #validation data
    for text_seq_val, label_seq_val in val_dataset:
        #getting val output
        loss_val, pred_out_val = val_step(text_seq_val, label_seq_val, loss_function)
        #appending to lists
        val_true.append(label_seq_val)
        val_predicted.append(pred_out_val)
        val_loss(loss_val)
        
        #calculating metric
        batch_metric_val = roc_auc_score(label_seq_val, pred_out_val)
        val_metric(batch_metric_val)
    
    
    #calculating final loss and metric   
    val_true = tf.concat(val_true, axis=0)
    val_predicted = tf.concat(val_predicted, axis=0)
    val_loss_final = loss_function(val_true, val_predicted)
    val_metric_auc = roc_auc_score(val_true, val_predicted)
    
    #printing
    template = '''Epoch {}, Train Loss: {:0.6f}, Mean batch Train Loss: {:0.6f}, AUC: {:0.5f}, Mean batch Train AUC: {:0.5f},
    Val Loss: {:0.6f}, Mean batch Val Loss: {:0.6f}, Val AUC: {:0.5f}, Mean batch Val AUC: {:0.5f}'''
    
    print(template.format(epoch+1, train_loss_final.numpy(), train_loss.result(), 
                          train_metric_auc, train_metric.result(), val_loss_final.numpy(),
                          val_loss.result(), val_metric_auc, val_metric.result()))
    print('-'*30)
    
    #tensorboard
    with tf.name_scope("per_epoch_loss_metric"):
        with wtrain.as_default():
            tf.summary.scalar("mean_loss", train_loss.result().numpy(), step=epoch)
            tf.summary.scalar('loss', train_loss_final.numpy(), step=epoch)
            tf.summary.scalar('metric', train_metric_auc, step=epoch)
            tf.summary.scalar('mean_metric', train_metric.result().numpy(), step=epoch)
        with wval.as_default():
            tf.summary.scalar('mean_loss', val_loss.result().numpy(), step=epoch)
            tf.summary.scalar('loss', val_loss_final.numpy(), step=epoch)
            tf.summary.scalar('metric', val_metric_auc, step=epoch)
            tf.summary.scalar('mean_metric', val_metric.result().numpy(), step=epoch)

Epoch 1, Train Loss: 0.700775, Mean batch Train Loss: 0.700775, AUC: 0.46829, Mean batch Train AUC: 0.45378,
    Val Loss: 0.704532, Mean batch Val Loss: 0.704532, Val AUC: 0.48223, Mean batch Val AUC: 0.48844
------------------------------
Epoch 2, Train Loss: 0.596350, Mean batch Train Loss: 0.596350, AUC: 0.86608, Mean batch Train AUC: 0.86355,
    Val Loss: 0.691127, Mean batch Val Loss: 0.691127, Val AUC: 0.52128, Mean batch Val AUC: 0.53295
------------------------------
Epoch 3, Train Loss: 0.508518, Mean batch Train Loss: 0.508518, AUC: 0.98973, Mean batch Train AUC: 0.98923,
    Val Loss: 0.681388, Mean batch Val Loss: 0.681388, Val AUC: 0.55682, Mean batch Val AUC: 0.57112
------------------------------
Epoch 4, Train Loss: 0.441114, Mean batch Train Loss: 0.441114, AUC: 0.99554, Mean batch Train AUC: 0.99460,
    Val Loss: 0.673574, Mean batch Val Loss: 0.673574, Val AUC: 0.58578, Mean batch Val AUC: 0.60539
------------------------------
Epoch 5, Train Loss: 0.368985, Mean batch Train Loss: 0.368985, AUC: 0.99868, Mean batch Train AUC: 0.99861,
    Val Loss: 0.667929, Mean batch Val Loss: 0.667929, Val AUC: 0.61167, Mean batch Val AUC: 0.62760
------------------------------
Epoch 6, Train Loss: 0.306646, Mean batch Train Loss: 0.306646, AUC: 0.99956, Mean batch Train AUC: 1.00000,
    Val Loss: 0.664882, Mean batch Val Loss: 0.664882, Val AUC: 0.62835, Mean batch Val AUC: 0.63807
------------------------------
Epoch 7, Train Loss: 0.249700, Mean batch Train Loss: 0.249700, AUC: 1.00000, Mean batch Train AUC: 1.00000,
    Val Loss: 0.666024, Mean batch Val Loss: 0.666024, Val AUC: 0.63756, Mean batch Val AUC: 0.64217
------------------------------
Epoch 8, Train Loss: 0.195906, Mean batch Train Loss: 0.195906, AUC: 1.00000, Mean batch Train AUC: 1.00000,
    Val Loss: 0.671024, Mean batch Val Loss: 0.671024, Val AUC: 0.64063, Mean batch Val AUC: 0.64618
------------------------------
Epoch 9, Train Loss: 0.151549, Mean batch Train Loss: 0.151549, AUC: 1.00000, Mean batch Train AUC: 1.00000,
    Val Loss: 0.679804, Mean batch Val Loss: 0.679804, Val AUC: 0.64458, Mean batch Val AUC: 0.64464
------------------------------
Epoch 10, Train Loss: 0.111988, Mean batch Train Loss: 0.111988, AUC: 1.00000, Mean batch Train AUC: 1.00000,
    Val Loss: 0.695000, Mean batch Val Loss: 0.695000, Val AUC: 0.64283, Mean batch Val AUC: 0.64751
------------------------------

I trained the model for 10 epochs and my loss is decreasing and AUC of train data became 1(overfit). But some times it may not overfit to the model. If it is not overfitting, there may be so many reasons like code written to create the model is incorrect, the model is not capable of learning the data, learning problems like vanishing or exploding gradients and many more. I will discuss these problems below and These problems may occur even while training with total data.

Check whether forward propagation is correct or not

while training NN, we will use the vectorizing implementations of data manipulation. If we did any mistake in these implementations, our training process will give bad results. We can verify this with a simple hack using back prop dependency. Below are the steps to do.

Take a few data points. Here I am taking 5 data points. You can get it from the data or you can generate random data with the same shape.
do forward propagation on the model we created with the above batch data.
write a loss function that takes the true values, predicted values and returns loss as sum of the i^th data point output where i less than 5. I am using 3.
do the back prop and check the gradients with respect to the input data points. If you are getting non zero gradients only for i-th data point, your forward propagation is right otherwise, there is some error in the forward propagation and you have to debug the code to check the error.

In the implementation below, I have written basic implementation, not included any tensorboard/metrics and there is no need for those as well.

Note: Gradient won’t flow through the embedding layer so you will get None gradients if you calculate the gradient with of loss with respect to the input. If you have the embedding layer at starting, please remove the embedding layer and give the input directly to the next layer. It is very easy to do because This layer can only be used as the first layer in a model.

##same model with name changes and without emedding layer.
def get_model_check():
    ##directly using embedding dimention of 1. It is only for checking so no problem with it. 
    input_layer = Input(shape=(24, 1), batch_size=10, name="input_layer_debug")
    
    ##i am initilizing randomly. But you can use predefined embeddings. 
    #x_embedd = Embedding(input_dim=13732, output_dim=100, input_length=24, mask_zero=True, 
                        #embeddings_initializer=tf.keras.initializers.RandomNormal(mean=0, stddev=1, seed=23),
                         #name="Embedding_layer")(input_layer)
    
    x_lstm = LSTM(units=20, activation='tanh', recurrent_activation='sigmoid', use_bias=True, 
                 kernel_initializer=tf.keras.initializers.glorot_uniform(seed=26),
                 recurrent_initializer=tf.keras.initializers.orthogonal(seed=54),
                 bias_initializer=tf.keras.initializers.zeros(), name="LSTM_layer_debug")(input_layer)
    
    x_out = Dense(1, activation='sigmoid', kernel_initializer=tf.keras.initializers.glorot_uniform(seed=45),
                  name="output_layer_debug")(x_lstm)
    
    basic_model_debug = Model(inputs=input_layer, outputs=x_out, name="basic_lstm_model_debug")
    
    return basic_model_debug

basic_model_debug = get_model_check()

##generated random 5 data points of shape (24,1) i.e 4 time steps and 1 dim embedding. 
temp_features = np.random.randint(low=1, high=5, size=(5,24, 1))

##generated the a random output zero or 1. I think, there is no use for this as well because 
#we will calculate loss only with predicted values
temp_outs = np.random.randint(0, 2, size=(5,1))

def loss_to_ckgrads(y_true, y_pred):
    #y_pred is one dimentional you can give directly one data point prediction as loss. 
    #I am giving loss as 3rd data point prediction so we will get non zero gradients only for 3rd data point. 
    #if your prediction is sequence, please add all the i-th data point predictions and return those. 
    return y_pred[2]

def get_gradient(model, x_tensor):
    #taping the gradients
    with tf.GradientTape() as tape:
        #explicitly telling to watch for input vector. it won't watch with repect to any inputs by default.
        #it only watches the gradents with weight vectors
        tape.watch(x_tensor)
        #model predictions
        preds = model(x_tensor)
        #getting the loss
        loss = loss_to_ckgrads(temp_outs, preds)
    #getting the gradients    
    grads = tape.gradient(loss, x_tensor)
    return grads
##making temp_feature as varible. We can get the gradients only if it is a varible so chnaging it to variable
temp_features = tf.Variable(tf.convert_to_tensor(temp_features, dtype=tf.float32))
##
grads = get_gradient(basic_model_debug, temp_features)
for i in grads:
    #checking whether all zeros or not
    #except 3rd all the grdients should be zero i.e True
    print(all(i==0))

True
True
False
True
True

If you are not getting all true except the i-th one, you may have any issue in your code. You have to check that and resolve it. Without that, don't go to another step.</p> </div> </div> </div>

What to do when Loss Explodes

while training NN, you may get NaN/inf loss becuase of large or small values. Below are some causes

Numerical stability issues.
- Check the multiplications, if you are multiplying so many tensors at once, apply log and make it to addition.
- Check for the division operation. any zero division is happening or not. Try to add a small constant like 1e-12 to the denominator.
- Check the softmax function. If your vocab size if very large, try not to use the softmax function. calculate the loss based on the logits.
If the updates to the weights are very large, you may get numerical instability and it may explode.
- Check for the Learning rate. If the learning rate is high, you may get this problem as well.
- Check for the exploding gradient problem. In tensorboard, you can visualize the gradient histograms and you can check the problem. If gradients are exploding, try to clip the gradients. You can apply tf.linalg.normalize or tf.clip_by_value to your gradients after getting gradients from the GradientTape.
It may occur because of a poor choice of loss function i.e. allowing the calculation of large error values.
It may occur because of the poor data preparation i.e. allowing large differences in the target variables.

What to do when loss Increases

while training NN, our loss may increase some times. Below are some causes

Check for the Learning rate. If the learning rate is high, you may get this problem as well.
Check for the wrong loss function. Especially sign of the loss function.
Activation functions applying over wrong dimensions. (you can find this out using the point number 1(checking forward propagation is correct or not)

What to do when loss Oscillate

while training NN, our loss may oscillate. Below are some causes

Check for the Learning rate. If the learning rate is high, you may get this problem as well.
Sometimes it may occur because of the exploding gradient problem. so check for that one as well. You can check this using the Tensorboard.
It may occur due to data pairing issues/data corruption. We already discussed this. so make sure to get the proper data.

What to do when loss is constant

while training NN, our loss constant. Below are some causes

If the updates to the weights are very low, you may end up in the same position.
- Check for the learning rate. If the learning rate is low, our weights won't update much so you may get this problem.
- Check for Vanishing Gradient problem. In Tensorboard, you can visualize the gradient histograms and you can check the problem.
  - You can solve this by changing the activations to relu/leaky relu.
  - You can add skip connections to an easier flow of gradients.
  - If you have long sequences in RNN, you can divide into smaller ones and train with stateful LSTM's(Truncated Back prop)
  - Better weight initialization may reduce this.
Too much regularization may also cause this.
If you are using Relu activation, it may occur due to the dead neurons.
Incorrect inputs to the loss function. I already discussed this while discussing the loss functions.

What if we get memory Errors

while training NN, many people face the memory exhaust errors because of the computing constraints.

If you are getting GPU memory exhaust error, try to reduce the batch size and train the neural network.
If your data doesn't fit into the RAM you have, Try to create a data pipeline using tf.data or Keras/Python Data Generators and load the batchwise data. My personal choice is to use tf.data pipelines. Please check this blog to know more about it.
Please try to check for the duplicate operations like creating multiple models, storing temporary variables in the GPU memory.

What if we Underfit to the data

some suggestions to make in decreasing order of priority

Make your model bigger
Reduce/Remove regularization(L1/L2/Dropout) if any.
Do error analysis. based on this try to change the preprocessing/data if needed.
Read technical papers and choose the state of the art models.
Tune hyperparameters
Add custom features if needed.

What if we Overfit to the data

some suggestions to make in decreasing order of priority

Add more training data
Add normalization layers(BN, layer norm)
Add data augmentation
Increase regularization
Do error analysis. based on this try to change the preprocessing/data if needed.
Choose a different model
Tune hyperparameters

You can check some of my other blogs at this link. This is my LinkedIn and GitHub</p> </div> </div> </div> </div>

What to do when Loss Explodes

while training NN, you may get NaN/inf loss becuase of large or small values. Below are some causes

Numerical stability issues.
- Check the multiplications, if you are multiplying so many tensors at once, apply log and make it to addition.
- Check for the division operation. any zero division is happening or not. Try to add a small constant like 1e-12 to the denominator.
- Check the softmax function. If your vocab size if very large, try not to use the softmax function. calculate the loss based on the logits.
If the updates to the weights are very large, you may get numerical instability and it may explode.
- Check for the Learning rate. If the learning rate is high, you may get this problem as well.
- Check for the exploding gradient problem. In tensorboard, you can visualize the gradient histograms and you can check the problem. If gradients are exploding, try to clip the gradients. You can apply tf.linalg.normalize or tf.clip_by_value to your gradients after getting gradients from the GradientTape.
It may occur because of a poor choice of loss function i.e. allowing the calculation of large error values.
It may occur because of the poor data preparation i.e. allowing large differences in the target variables.

What to do when loss Increases

while training NN, our loss may increase some times. Below are some causes

Check for the Learning rate. If the learning rate is high, you may get this problem as well.
Check for the wrong loss function. Especially sign of the loss function.
Activation functions applying over wrong dimensions. (you can find this out using the point number 1(checking forward propagation is correct or not)

What to do when loss Oscillate

while training NN, our loss may oscillate. Below are some causes

Check for the Learning rate. If the learning rate is high, you may get this problem as well.
Sometimes it may occur because of the exploding gradient problem. so check for that one as well. You can check this using the Tensorboard.
It may occur due to data pairing issues/data corruption. We already discussed this. so make sure to get the proper data.

What to do when loss is constant

while training NN, our loss constant. Below are some causes

If the updates to the weights are very low, you may end up in the same position.
- Check for the learning rate. If the learning rate is low, our weights won't update much so you may get this problem.
- Check for Vanishing Gradient problem. In Tensorboard, you can visualize the gradient histograms and you can check the problem.
  - You can solve this by changing the activations to relu/leaky relu.
  - You can add skip connections to an easier flow of gradients.
  - If you have long sequences in RNN, you can divide into smaller ones and train with stateful LSTM's(Truncated Back prop)
  - Better weight initialization may reduce this.
Too much regularization may also cause this.
If you are using Relu activation, it may occur due to the dead neurons.
Incorrect inputs to the loss function. I already discussed this while discussing the loss functions.

What if we get memory Errors

while training NN, many people face the memory exhaust errors because of the computing constraints.

If you are getting GPU memory exhaust error, try to reduce the batch size and train the neural network.
If your data doesn't fit into the RAM you have, Try to create a data pipeline using tf.data or Keras/Python Data Generators and load the batchwise data. My personal choice is to use tf.data pipelines. Please check this blog to know more about it.
Please try to check for the duplicate operations like creating multiple models, storing temporary variables in the GPU memory.

What if we Underfit to the data

some suggestions to make in decreasing order of priority

Make your model bigger
Reduce/Remove regularization(L1/L2/Dropout) if any.
Do error analysis. based on this try to change the preprocessing/data if needed.
Read technical papers and choose the state of the art models.
Tune hyperparameters
Add custom features if needed.

What if we Overfit to the data

some suggestions to make in decreasing order of priority

Add more training data
Add normalization layers(BN, layer norm)
Add data augmentation
Increase regularization
Do error analysis. based on this try to change the preprocessing/data if needed.
Choose a different model
Tune hyperparameters

You can check some of my other blogs at this link. This is my LinkedIn and GitHub</p> </div> </div> </div> </div>

	content	annotation	extras	metadata
0	Get fucking real dude.	{'notes': '', 'label': ['1']}	NaN	{'first_done_at': 1527503426000, 'last_updated...
1	She is as dirty as they come and that crook R...	{'notes': '', 'label': ['1']}	NaN	{'first_done_at': 1527503426000, 'last_updated...

	content	label
0	Get fucking real dude.	1
1	She is as dirty as they come and that crook R...	1
2	why did you fuck it up. I could do it all day ...	1
3	Dude they dont finish enclosing the fucking sh...	1
4	WTF are you talking about Men? No men thats no...	1